Man-Machine Interaction Using Speech

نویسنده

  • David R. Hill
چکیده

ed acoustic descriptions to an acoustically matching response. This is where the requirement for information at a higher level than the acoustic level arises, for out of all the pattern groupings that could be learned, only those that are meaningful are learned. Sutherland [145] argues the points involved in such a model of perception very cogently, for the visual case. We do not, at present, know enough about the process of speech perception to duplicate the feat. Sutherland’s model for visual perception, which is still highlv general at present, arose because, as a psychologist obtaining results from experiments on visual perception. he found that information from work on the neurophysiology of vision and on machine procedures for automatic pattern recognition (particularly Clowes [18, 19], but also well exemplified by Guzman [ 48, 49]) could be combined into a model of visual pattern recognition which offered great explanatory power in terms of the results obtained. A start on the explanation of speech perception may yet well involve a similar interdisciplinary approach, based on the same combination of psychology, machine methods, and neurophysiology. An appreciation of the need for such an explanation, on the part of those attempting machine perception of speech, and an appreciation of the source of relevant information, including related work in the field of visual pattern recognition, is probably essential to real progress in machine descriptions of speech patterns, of which the feature extraction problem is but part. (d) Models and Analysis The development within this subsection will follow, as far as possible, the lines of the section on models and synthesis [Section 4.2.2(b)] in an attempt to show the relationships, and illustrate more clearly some of the unexplored areas. The picture is complicated somewhat by the need to account for a time scale which can no longer be chosen arbitrarily, and which varies considerably; by the ingenious variations that have proved possible, in describing components of the models involved, in specific realizations; and by the fact that, whatever model is adopted, there ultimately must be some form of segmentation, which interferes with any attempt at clear distinctions between the approaches, in the former terms. Also physiologicallv based approaches are almost unrepresented. The very simplest model of speech, then, assumes a series of concatenated units. In the analysis case these may be interpreted as segments of the waveform which can he identified. Since segmentation at some stage is inevitable, this approach hardly differs from that of waveform matching, except that by trying to match shorter sections. the process should presumably be somewhat easier. An approach which identified phonemes on the basis of autocorrelation analysis of waveform segments would be in this category, but no significant devices appear in the literature. Such an approach would almost certainly require reliable direct segmentation. Other approaches to analysis attempt to retrieve the ingredients that would need to be used, by some particular model, in producing the unknown utterance. These ingredients may be used as descriptors of segments at particular levels in order to classify them. Almost always, segments compatible with phonemic analysis are used at some stage. although for convenience in computer analysis the raw data may initially consist of shorter segments arising from some fixed sampling DAVID R. HILL 204 MAN-MACHINE INTERACTION USING SPEECH 205 scheme. Early approaches to automatic speech recognition worked almost entirely in the domain of the acoustic analog, and the simplest version thereof. The approach is characterized by some form of spectrographic analysis, giving a two-dimensional array of data points representing energy intensity at some time and frequency. Time, frequency, and intensity are usually quantized to have a small number of discrete values—usually two for intensity, energy present or absent. Thus analysis produced a binary pattern which could be matched against stored patterns derived from known words, and a decision made as to which of the stored patterns most nearly resembled the unknown input, hence naming the input. This was the method of Sebesteyen [130], Uhr and Vossler [154, 155], Purton [120]—who actually uses multi-tap autocorrelation analysis rather than filter analysis, Shearme [137], Balandis [3]—who actually uses a mechanical filter system, and Denes and Matthews [24], as well as others. A rather similar kind of analysis may be obtained in terms of zero-crossing interval density or reciprocal zero-crossing interval density, the latter being closely related to frequency analysis [127, 128, 132, 139]. The major difficulties with either analysis lie in the time and frequency variability of speech cues. Time-normalization on a global basis (squashing or stretching the time scale to fit a standard measure) assumes, for example, that a longer utterance has parts that are all longer by the same percentage, which is not true. One simple way around the difficulty was adopted by Dudley and Balashek [28]. They integrated with respect to time for each of ten selected spectral patterns, and based their word decision on matching the ten-element “duration-of-occurrence” pattern, for an unknown input, against a master set. The more usual approach adopted has been to “segment” the input in some way, usually into phonemic segments, so that a series of (supposedly significant) spectral patterns, or spectrally based phonetic elements, results. Segmentation is either indirect—segments beginning when a named pattern is first detected and ending when it is no longer detected—or it is based on significant spectral change, much as suggested by Fant [see Section 5.2.2(b)]. Segmentation is frequently two-stage in computer-based approaches, the short (10 msec) raw data segments due to the sampling procedure being lumped to give the larger segments required. Vicens [158] describes one such scheme of great significance, using three stages, followed by “synchronization” of the input segments detected with the segments of “candidate” recognition possibilities stored in memory. His approach may be considered a “head-on” attack on the related problems of segmentation and time normalization and is the only scheme with a demonstrated ability to handle words in connected speech. The work is important for other reasons, as well. Other examples of the indirect approach include Olson and Belar [109], Bezdel [5], Fry [43], and Denes [22]—who also built in specific linguistic knowledge for error correction, and Scarr [133]—who incorporated an interesting scheme for amplitude normalization of the input speech. Examples of the direct approach include Gold [45]—who included other segmentation clues as well, Ross [125]—who included some adaptation, Traum and Torre [150], and Sakai and Doshita [128]. Needless to say either approach requires a further stage of recognition, to recogize words. Sakai and Doshita, and Bezdel used mainly zero-crossing measurements in their schemes. This may he significant in building machines entirely from digital components. At this stage. the different approaches become harder to disentangle. Individual segments. usually phonemic in character, may be described partly in terms of the kind of parameters used by a parametric resonance analog synthesizer, and partly in terms of measures derived from these parameters, or from spectral attributes. Otten [112] and Meeker [101] both proposed that time parameters in a formant vocoder system should be adequate for recognition, but the results of any such approach are not generally available. Forgie and Forgie approached vowel and fricative recognition on the basis of formant values, fricative spectra and transient cues, which are related to such parameters, and were quite successful [39, 40]. They reported up to 93% correct on vowel recognition, and “equal to humans” on fricative recognition. Frick [42] pointed out the advisability of “not putting all the eggs in one basket,” stating the M.I.T. Lincoln Laboratory aim at that time as being to define a set of cues which might be individually unreliable but, in combination, could lead to a reliable decision. This really indicates the general feeling for the decade that followed. Such a philosophy clearly leads to combined approaches. which are still in vogue. Here, really, is the nub of the “feature extraction problem.” which now centers around the question of which particular blend of which features is best suited to describing individual segments. The not too surprising conclusion, according to Reddy, is that it depends on what particular kind of segment one is trying to classify. Thus in Reddy’s scheme [121], segments are broadly classified on the basis of intensity and zerocrossing data, and finer discrimination within categories is accomplished on the basis of relevant cues. This seems an important idea, spilling over from research on the problems of search in artificial intelligence (and is the one developed by Vicens). He emphasizes the point that much automatic speech recognition research has been directed at seeking a structure, in terms of which the problems might he formulated, rather than seeking the refutation of a model or hypothesis. He further remarks that lack of adequate means for DAVID R. HILL 206 MAN-MACHINE INTERACTION USING SPEECH 207 collecting and interacting with suitable data has held up this aspect of the work. In 1951 Jakobson, Fant. and Halle published their Preliminaries to Speech Analysis. This work. recently reprinted for the eighth time [72] has a continuing importance for those working in fields of speech communication. The idea of distinctive features is based on the idea of “minimal distinctions,” a term coined by Daniel Jones [74] to indicate that any lesser distinction (between two sound sequences in a language) would be inadequate to distinguish the sequences clearly. Distinctive features are thus tarred with the same brush as phonemes; they are functionally defined. In Preliminaries to Speech Analysis it is suggested that a minimal distinction faces a. listener with a two-choice situation between polar values of a given attribute, or presence versus absence of some quality. There may be double, or triple differences between some sound sequences. But the fact that there are interactions between adjacent phonetic elements in speech means that in practice a “minimal distinction” may not be confined to one phoneme. For example English “keel” and “call” are distinguished by more than the distinctions between the medial vowel sounds; for instance, the initial velar stops are markedly different, but the difference is not phonemically significant in English (though it is in some languages). However, distinctive features form a ready-made set of binary descriptors for phonemes, which fact has had practical and psychological effects on those working towards machine recognition. In their book, Jakobson, Fant and Halle describe the acoustic and articulatorv correlates of their distinctive features, but in qualitative terms. They remark that use of distinctive features for acoustic analysis would make the analysis easier, and supply the most instructive (acoustic) correlates. The search for acoustic correlates is not yet ended, for rather the same reasons that phonemes have resisted description. Daniel Jones’ discussion of minimal distinctions in [74] reveals some of the difficulties. Nevertheless, approaches in terms of “distinctive features,” whether following the original set, or merely using the concept, but instrumenting features more easily detected by machine, have been made, and the concept has had a systematizing effect. One noteworthy scheme, which follows the original distinctive feature set closely is that of Hughes and Hemdal [67]. Vowel recognition scores of 92%, and consonant recognition scores of 50%. giving over-all recognition scores for words comparable with those of listeners, are reported for a single speaker. Their results indicated that hopes of a nonadaptive recognition scheme for more than one speaker arc ruled out. An earlier approach. based on the original distinctive features, was that due to Wiren and Stubbs [169]. Work by Bobrow and his colleagues [7, 8] is the only reported truly parametric approach, though using rather different parameters from those required for a resonance analog synthesizer. Various binary features are detected, on the basis of the outputs of a nineteen-channel spectrum analyzer, and these are then treated independently, as binary parameters of individual utterances. For each feature (which at any given time may he “1” or “0”) the time dimension resulting from fixed sampling is collapsed, to give a sequence of single l’s and 0’s (i.e., 101 rather than 1110001 ). Each feature sequence present votes for any recognition possibility in which it has appeared, the output comprising the most popular possibility. Two sets of features were used, a linguistically oriented set and an arbitrary set. Both performed comparably, but the latter set degraded more when the parameter patterns for one speaker were tried on another. Single speaker recognition scores of 95% were achieved for single speakers on a fifty-four word vocabulary. The developments reported in the second paper were concerned with the testing of some new features in conjunction with an increased vocabulary size to 109 words. Recognition scores of 91-94% were achieved. This approach is a departure from the con ventional approach of segmenting into phonemes, classifying the phonemes on some basis, and then recognizing words on the basis of phoneme strings. However, it throws away a certain amount of information about the relative timing of events. Vicens’ scheme carried out segmentation, primary classification, and detailed matching in terms of six parameters based on the amplitude and zero-crossing rates of the signals derived in three frequency bands related to vowel formant domains. it is thus, in a sense, a parametric approach, but is concerned with segments as descriptors (of units of meaning), so is not truly parametric. Recognition performance for isolated words and single speaker (98% on a fifty-four word vocabulary in English, 97% on a seventy word vocabulary in French, and 92% on a 561 word vocabulary in English) and for words embedded in the connected utterances of strings of words in simple command language statements (96% on words, 85% for complete commands) is impressive. He makes a number of important points concerning ASR machines: that the features or parameters used are less important than the subsequent processing used; that accurate phoneme-like classification may be unnecessary in practical applications using a suitably restricted language; and that the techniques of artificial intelligence are important in ASR research. He also points out that Bobrow’ and his colleagues [7, 8] have an approach which does not extend easily to longer connected utterances in which a division into smaller segment. (words or phonemes) is necessary, though this is clearly understood DAVID R. HILL 208 MAN-MACHINE INTERACTION USING SPEECH 209 by those authors, who called their machine “LISPER” (Limited SPEech Recognizer). One reason is that information concerning relative timing of feature sequences is not preserved. Another new approach to the handling of the kind of information derived from binary feature extractors is described by this author in earlier papers [59, 60]. The approach is based on the notion of treating the start and end of binary feature occurrences as events. The input may then be described in terms of the occurrence and non-occurrence of various subsequences of these events, which provides a binary pattern output for decision taking. There are means of applying various constraints on what events are allowed, and what events are not allowed in the sub-sequences. The scheme has certain general properties in common with that of Bobrow and his colleagues, but it specifically includes information about relative timing of individual features. In this respect, it perhaps has something in common with the model of speech which regards speech as a mixture of events [see Section 4.2.2(h)]. It assumes the necessity of analyzing speech explicitly in terms of both content (events) and order, a distinction noted by Huggins [66] as long ago as 1953. The scheme (at time of writing) has had no testing with a reasonable set of features. Experiments on an early prototype hardware version, using a feature set consisting of just the two features “presence of high frequency” and “presence of low frequency,” gave recognition scores of 78% correct. 10%, reject, and 12% misrecognized, for a sixteen-word vocabulary of special words (chosen with the feature limitations in mind) and twelve unknown speakers [59]. In the new machine [60] the determination of subsequences is independent of any absolute time scale. As a consequence of the strategy used to achieve this time independence the machine may he able to say what has occurred at some level of analysis. but it may be unable to say at all precisely when it occurred, since the irrelevant aspects of absolute time are discarded at an early stage in the processing. This failing it seems to share with the human [78]. The detection of subsequences is carried out by a “sequence detector.” This is essentially an implementation of a simple grammar-based description machine, and operates in the dimensions of time and auditory primitives in a manner entirely analogous to the way, say, Guzman’s figure description language operates in the dimensions of space and visual primitives (see below. Section 5.3). The bit pattcrn output provides a staticized commentary on the history of the input . By arranging for deactivation of those binary outputs from the sequence detector which lead to a particular decision at some level of meaning (initially words), or which are too long past to he relevant to the current situation, the system is readily extended to dealing with the recognition of successive parts of connected utterances. Other approaches, each of interest in its own way, can be listed. An exhaustive survey is out of the question. Dersch [25] described a recognizer based on waveform asymmetry, as a measure of voicing and as a crude vowel discriminant. Recognition depended on identification of a voiced segment, which was crudely classified and which could be preceded and/or followed by friction. A recognition rate of 90% to the ten digits is quoted, using both male and female speakers. Scores could be improved with practice. Martin et al. [94] describe a feature based system using neural logic operating on spectral information. Following recognition of phoneme segments, words are recognized on the basis of particular phoneme sequence occurrences. Teacher et al. [147] describe a system aimed at economy. Segments are classified according to the quantized values of only three parameters—the Single Equivalent Formant (SEF), voicing, and amplitude. The SEF is reported as a multi-dimensional measure of vowel quality based on the zerocrossing interval under the first peak of the waveform segment in a pitchsynchronous analysis. This is similar to work by Scarr [132]. Tillman et al. [149] describe an approach based on features convenient to machine analysis, but linguistically oriented. There are five features and (due to redundancy) seven possible combinations of these, which lead to segments by indirect segmentation. Recognition is based on the occurrence of sequences of segments. Von Keller [159, 160] reports an approach which involves formant tracking as one of the required operations. Initial segmentation into voiced fricative, unvoiced fricative, plosive, nasal, and vowel-like is achieved on the basis of two spectral slope measures (over-all slope, and slope below 1 kHz), amplitude, rate of change in amplitude, and a nasal indicator. Part of the decision pattern consists of discrete features, referring to the segment pattern (somewhat like Dersch’s machine) and part is continuous, based or key values of Fl and F2. Recognition rates around 95% are reported. Velichko and Zagoruyko [157] describe a scheme which recognizes segments on the basis of five frequency band parameters, and uses another form of segment synchronization (cf. Vicens, above) in the decision process. Single-speaker recognition scores of around 95%, for a 203-word vocabulary are reported. There is one final study that falls in a class almost by itself. Hillix and his colleagues [61, 62] report the use of six nonacoustic measures in a speech recognizer. These measures were designed to reflect physiological descriptions of speech and used special instrumentation. Single-speaker recognition rates were reported as 97%. Recognition of one speaker, on the basis of data obtained from three others, gave recognition scores of 78-86%. Approaches based on physiological models of any kind are rare. It is difficult to manage the required characterization, or the alternative of direct measurement. Analysis, DAVID R. HILL 210 MAN-MACHINE INTERACTION USING SPEECH 211 at the physiological model level has been confined to basic speech research. As Jakobson, Fant, and Halle remark in their book, there is a hierarchy of relevance in speech transmission: perceptual, aural. acoustical and articulatory. The last level is least relevant to identification of sounds. There may, in fact, be more than one way of operating at the articulatory level to produce the aural or perceptual effect. A good example of this is the perceptual similarity of the effect of lip rounding and pharyngealization.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation

Abstract   Recent developments in robotics automation have motivated researchers to improve the efficiency of interactive systems by making a natural man-machine interaction. Since speech is the most popular method of communication, recognizing human emotions from speech signal becomes a challenging research topic known as Speech Emotion Recognition (SER). In this study, we propose a Persian em...

متن کامل

Multimodal Dialog Management

Unreliable speech recognition, especially in noisy environments and the need for more natural interaction between man and machine have motivated the development of multimodal systems using speech, pointing, gaze, and facial expressions. In this paper we present a new approach to fuse multimodal information streams using agents. A general framework based on this approach that allows for rapid ap...

متن کامل

Automatic Speech Recognition Using Template Model for Man-Machine Interface

Speech is a natural form of communication for human beings, and computers with the ability to understand speech and speak with a human voice are expected to contribute to the development of more natural man-machine interfaces. Computers with this kind of ability are gradually becoming a reality, through the evolution of speech recognition technologies. Speech being an important mode of interact...

متن کامل

MAN-MACHINE INTERACTION SYSTEM FOR SUBJECT INDEPENDENT SIGN LANGUAGE RECOGNITION USING FUZZY HIDDEN MARKOV MODEL

Sign language recognition has spawned more and more interest in human–computer interaction society. The major challenge that SLR recognition faces now is developing methods that will scale well with increasing vocabulary size with a limited set of training data for the signer independent application. The automatic SLR based on hidden Markov models (HMMs) is very sensitive to gesture's shape inf...

متن کامل

Classification of emotional speech using spectral pattern features

Speech Emotion Recognition (SER) is a new and challenging research area with a wide range of applications in man-machine interactions. The aim of a SER system is to recognize human emotion by analyzing the acoustics of speech sound. In this study, we propose Spectral Pattern features (SPs) and Harmonic Energy features (HEs) for emotion recognition. These features extracted from the spectrogram ...

متن کامل

An integrated system for cooperative man-machine interaction

An Integrated System for Cooperative Man-Machine Interaction C. Bauckhage, G.A. Fink, J. Fritsch, F. Kummert, F. L omker, G. Sagerer, S. Wachsmuth Technical Faculty, Bielefeld University P.O. Box 33594, Bielefeld, Germany http://www.techfak.uni-bielefeld.de/techfak/ags/ai/ Abstract To establish robotic applications in human environments as e.g. o ces or private homes the robotic systems must b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Advances in Computers

دوره 11  شماره 

صفحات  -

تاریخ انتشار 1971